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Abstract 

Dynamic Bayesian networks (DBNs) offer an ele- 
gant way to integrate various aspects of language 
in one model. Many existing algorithms developed 
for learning and inference in DBNs are applicable 
to probabilistic language modeling. To demonstrate 
the potential of DBNs for natural language process- 
ing, we employ a DBN in an information extraction 
task. We show how to assemble wealth of emerg- 
ing linguistic instruments for shallow parsing, syn- 
tactic and semantic tagging, morphological decom- 
position, named entity recognition etc. in order to 
incrementally build a robust information extraction 
system. Our method outperforms previously pub- 
lished results on an established benchmark domain. 



1 Information Extraction 

Information extraction (IE) is the task of filling in tem- 
plate information from previously unseen text which be- 
longs to a pre-defined domain. The resulting database 
is suited for formal queries and filtering. IE systems 
generally work by detecting patterns in the text that 
help identify significant information. Researchers have 
shown iFreitag and McCallum, 1999; Ray and Craven, 2001 1 
that a probabilistic approach allows the construction of robust 
and well-performing systems. However, the existing proba- 
bilistic systems are generally based on Hidden Markov Mod- 
els (hmms). Due to this relatively impoverished representa- 
tion, they are unable to take advantage of the wide array of 
linguistic information used by many non-probabilistic IE sys- 
tems. In addition, existing HMM-based systems model each 
target category separately, failing to capture relational infor- 
mation, such as typical target order, or the fact that each ele- 
ment only belongs to a single category. This paper shows how 
to incorporate a wide array of knowledge into a probabilistic 
IE system, based on dynamic Bayesian networks (DBN) — a 
rich probabilistic representation that generalizes HMMs. 

Let us illustrate IE by describing seminar announcements 
which got established as one of the most popular bench- 
mark domains in the field |Califf and Mooney, 1999| 
Freitag and McCallum, 1999) |Soderland, 19991 

Roth and Yih, 200 1| |Ciravegna, 200 1| . People receive 



dozens of seminar announcements weekly and need to 



manually extract information and paste it into personal 
organizers. The goal of an IE system is to automatically 
identify target fields such as location and topic of a seminar, 
date and starting time, ending time and speaker. Announce- 
ments come in many formats, but usually follow some 
pattern. We often find a header with a gist in the form 
"PostedBy: johnghost . domain; Who: Dr. 
Steals; When: 1 am;" and so forth. Also in the 
body of the message, the speaker usually precedes both 
location and starting time, which in turn precedes ending 
time as in: ' 'Dr. Steals presents in Dean 
Hall at one am.'' The task is complicated since 
some fields may be missing or may contain multiple values. 

This kind of data falls into the so-called semi-structured 
text category. Instances obey certain structure and usually 
contain information for most of the expected fields in some 
order. There are two other categories: free text and structured 
text. In structured text, the positions of the information fields 
are fixed and values are limited to pre-defined set. Conse- 
quently, the IE systems focus on specifying the delimiters and 
order associated with each field. At the opposite end lies the 
task of extracting information from free text which, although 
unstructured, is assumed to be grammatical. Here IE systems 
rely more on syntactic, semantic and discourse knowledge in 
order to assemble relevant information potentially scattered 
all over a large document. 

IE algorithms face different challenges depending on the 
extraction targets and the kind of the text they are embedded 
in. In some cases, the target is uniquely identifiable (single- 
slot), while in others, the targets are linked together in multi- 
slot association frames. For example, a conference schedule 
has several slots for related speaker, topic and time of the pre- 
sentation, while a seminar announcement usually refers to a 
unique event. Sometimes it is necessary to identify each word 
in a target slot, while some benefit may be reaped from par- 
tial identification of the target, such as labeling the beginning 
or end of the slot separately. Many applications involve pro- 
cessing of domain-specific jargon like Internetese — a style of 
writing prevalent in news groups, e-mail messages, bulletin 
boards and online chat rooms. Such documents do not follow 
a good grammar, spelling or literary style. Often these are 
more like a stream-of-consciousness ranting in which ascii- 
art and pseudo-graphic sketches are used and emphasis is pro- 
vided by all-capitals, or using multiple exclamation signs. As 
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Table 1: Sample phrase and its representation in multiple feature values for ten tokens. 



we exemplify below, syntactic analysers easily fail on such 
corpora. 

Other exampl es of IE application dom ains include job 
advertisements ICaliff and Mooney, 19991 (RAPIER), ex- 
ecutive succession ISoderland, 1999| (WHISK), restaurant 
guides |Musle a~q/!, 2001| (S TALKER), biological publi- 
cations I Ray and Craven, 2001 1 etc. Initial interest in the 
subject was stimulated by ARPA's Message Understand- 
ing Conferences (MUC) which put forth challenges e.g. 
parsing newswire articles related to terrorism (see e.g. 
Mikheev 1 1998 1). Below we briefly review various IE systems 
and approaches which mostly originated from MUC compe- 
titions. 

Successful IE involves identifying abstract patterns in the 
way information is presented in text. Consequently, all pre- 
vious work necessarily relies on some set of textual fea- 
tures. The overwhelming majority of existing algorithms 
operate by building and pruning sets of induction rules de- 
fined on these features (SRV, RAPIER, WHISK, LP 2 ). There 
are many features that are potentially helpful for extracting 
specific fields, e.g. there are tokens and delimiters that sig- 
nal the beginning and end of particular types of informa- 
tion. Consider an example in tabled which shows how the 
phrase "Doctor Steals presents in Dean Hall 
at one am." is represented through feature values. For 
example, the lemma "am" designates the end of a time field, 
while the semantic feature "Title" signals the speaker, and 
the syntactic category NNP (proper noun) often corresponds 
to speaker or location. Since many researchers use the semi- 
nar announcements domain as a testbed, we have chosen this 
domain in order to have a good basis of comparison. 

One of the systems we compare to (specifically designed 
for single-slot problems) is SRV iFreitag, 19981. It is built 
on three classifiers of text fragments. The first classifier is 
a simple look-up table containing all correct slot-fillers en- 
countered in the training set. The second one computes the 
estimated probability of finding the fragment tokens in a cor- 
rect slot-filler. The last one uses constraints obtained by rule 
induction over predicates like token identity, word length and 
capitalization, and simple semantic features. 

RAPIER paliff and Mooney, 1999| is fully based on 
bottom-up rule induction on the target fragment and a few to- 
kens from its neighborhood. The rules are templates specify- 
ing a list of surrounding items to be matched and potentially, 
a maximal number of tokens for each slot. Rule generation 
begins with the most specific rules matching a slot. Then 



rules for identical slots are generalized via pair- wise merging, 
until no improvement can be made. Rules in RAPIER are for- 
mulated as lexical and semantic constraints and may include 
PoS tags. WHISK | |Soderland, 1999| uses constraints similar 
to RAPIER, but its rules are formulated as regular expressions 
with wild cards for intervening tokens. Thus, WHISK encodes 
a relative, rather than absolute position of tokens with respect 
to the target. This enables modeling long distance dependen- 
cies in the text. WHISK performs well on both single-slot and 
multi-slot extraction tasks. 

Ciravegna 12001 1 presents yet another rule induction 
method (LP) 2 . He considers several candidate features such 
as lemma, lexical and semantic categories and capitalization 
to form a set of rules for inserting tags into text. Unlike other 
approaches, (LP) 2 generates separate rules targeting the be- 
ginning and ending of each slot. This allows for more flexi- 
bility in subjecting partially correct extractions to several re- 
finement stages, also relying on rule induction to introduce 
corrections. Emphasizing the relational aspect of the domain, 
Roth and Yoh 12001 1 developed a knowledge representation 
language that enables efficient feature generation. They used 
the features in a multi-class classifier SNOW-IE to obtain the 
desired set of tags. The resulting method (SNOW-IE) works 
in two stages: the first filters out the irrelevant parts of text, 
while the second identifies the relevant slots. 

Freitag & McCallum 119991 use hidden Markov models 
(HMM). A separate HMM is used for each target slot. No pre- 
processing or features is used except for the token identity. 
For each hidden state, there is a probability distribution over 
tokens encountered as slot-fillers in the training data. Weakly 
analogous to templates, hidden state transitions encode reg- 
ularities in the slot context. In particular prefix and suffix 
states are used in addition to target and background slots to 
capture words frequently found in the neighborhood of tar- 
gets. Ray&Craven 12001 1 make one step further by setting 
HMM hidden states in a product space of syntactic chunks 
and target tags to model the text structure. The success of the 
HMM-based approaches demonstrate the viability of proba- 
bilistic methods for this domain. However, they do not take 
advantage of the linguistic information used by the other ap- 
proaches. Furthermore, they are limited by using a separate 
HMM for each target slot, rather than extracting data in an 
integrated way. 

The main contribution of this paper is in demonstrating 
how to integrate various aspects of language in a single prob- 
abilistic model, to incrementally build a robust information 



extraction system based on a Bayesian network. This sys- 
tem overcomes the following dilemma. It is tempting to use 
a lot of linguistic features in order to account for multiple as- 
pects of text structure. However, deterministic rule induction 
approaches seem vulnerable to the performance of feature 
extractors in pre-processing steps. This presents a problem 
since syntactic instruments that have been trained on highly- 
polished grammatical corpora, are particularly unreliable on 
weakly grammatical semi-structured text. Furthermore incor- 
porating many features complicates the model which often 
has to be learned from sparse data, which harms performance 
of classifier-based systems. 

2 Features 

Our approach is statistical, which generally speaking means 
that learning corresponds to inferring frequencies of events. 
The statistics we collect originates in various sources. Some 
statistics reflect regularities of the language itself, while oth- 
ers correspond to the peculiarities of the domain. With this 
in mind we design features which reflect both aspects. There 
is no limitation on the possible set of features. Local fea- 
tures like part-of-speech, number of characters in the token, 
capitalization and membership in syntactic phrase are quite 
customary in the IE. In addition one could obtain such char- 
acteristics of the word as imagibility, frequency of use, famil- 
iarity, or even predicates on numerical values. Since there is 
no need for features to be local, one might find useful includ- 
ing frequency of a word in the training corpus or number of 
occurrences in the document. Notice that the same set of fea- 
tures would work for many domains. This includes semantic 
features along with orthographic and syntactic features. 

Before we move on to presenting our system for proba- 
bilistic reasoning, let us discuss in some detail notation and 
methods we used in preliminary data processing and feature 
extraction. To use the data efficiently, we need to factor the 
text into "orthogonal" features. Rather than working with 
thousands of listems (generic words 1 ) in the vocabulary, and 
combining their features, we compress the vocabulary by an 
order of magnitude by lemmatisation or stemming. Ortho- 
graphic and syntactic information is kept in feature variables 
with just a few values each. 

Tokenization 

Tokenization is the first step of textual data process- 
ing. A token is a minimal part of text which is 
treated as a unit in subsequent steps. In our case to- 
kenization mostly involves separating punctuation charac- 
ters from words. This is particularly non-trivial for sep- 
arating a period I Manning and Schutze, 2001 1 since it re- 
quires identifying sentence boundaries. Consider a sen- 
tence: Speaker: Dr. Steals, Chief Exec, 
of rich.com, worth $10.5 mil. 

Lemmatisation 

We have developed a simple lemmatiser which combines out- 
come of some standard lematisers and stemmers into a look- 

1 A word is a sequence of alphabetical characters, which has some 
meaning assigned to it. This would cover words found in general and 
special vocabulary as well abbreviations, proper names and such. 



up table. Combined with lemmatisation is a step of spell 
checking to catch misspelled words. This is done by inter- 
facing with the UNIX ispell utility. 

Gazetteer 

Our original corpus contains about 11,000 different listems. 
This does not take into account tokens consisting of punc- 
tuation characters, numbers and such. About 10% are 
proper nouns. The question of building a vocabulary auto- 
matically was previously addressed in IE literature(see e.g. 
Riloff 1 19961). We use the intersection of two sets. The first 
set consists of words encountered as part of target fields and 
in their neighborhood. The second set consists of words fre- 
quently seen in the corpus. Aside from vocabulary there are 
two reserved values for Out-of- Vocabulary (OoV) words and 
Not-a-Word (NaW). For example see blank slots in the lemma 
row of Table ^ The first category encodes rare and unfa- 
miliar words, which are still identified as words according to 
their part of speech. The second category is for mixed alpha- 
numerical tokens, punctuation and symbolic tokens. 

Syntactic Categories 

We used LTChunk software from U.of Edinburgh NLP 
group IMi kheev et ah, 1998|. It produces 4 7 PoS tags from 
UPenn TreeBank set I Marcu s et ah, 199 4 1. We have clus- 
tered these into 7 categories: cardinal numbers (CD), nouns 
(nn), proper nouns (NNP), verbs (VB), punctuation (.), prepo- 
sition/conjunction (IN) and other (SYM). The choice of clus- 
ters seriously influences the performance, while keeping all 
47 tags will lead to large CPTs and sparse data. 

Syntactic Chunking 

Following Ray&Craven |200T], we obtain syntactic seg- 
ments (aka syntactic chunks) by running the Sundance sys- 
tem I Riloff, 19961 and flattening the output into four cate- 
gories corresponding to noun phrase (NP), verb phrase (VP), 
prepositional phrase (PP) and other (n/a). Table \l\ shows 
a sample outcome. Note that both the part-of-speech tag- 
ger and the syntactic chunker easily get confused by non- 
standard capitalization of a word "Presents" as shown by 
incorrect labels in parenthesis. "Steals" is incorrectly iden- 
tified as a verb, whose subject is "Doctor" and object is 
"Presents ". Remarkably, other state-of-th e-art syntactic anal- 
ysis tools |Charniak, 1999||Ra tnaparkhi , 1999| also failed on 
this problem. 

Capitalization and Length 

Simple features like capitalization and length 
of w ord are used by many researchers (e.g. 
SRV iFreitag andMcCallum, 19991) Case representa- 
tion process is straightforward except for the choice of 
number of categories. We found useful introducing an extra 
category for words which contain both lower and upper case 
letters (not counting the initial capital letter) which tend to 
be abbreviations. 

Semantic Features 

There are several semantic features which play important role 
in a variety of application domains. In particular, it is useful 



to be able to recognize what could be a person's name, ge- 
ographic location, various parts of address, etc. For exam- 
ple, we are using a list of secondary location identifiers pro- 
vided by US postal service, which identifies as such words 
like hall, wing, floor and auditorium. We also use a list of 
100000 most popular names from US census bureau; the list 
is augmented by rank which helps to decide in favor of first 
or last name for cases like "Alexander". In general this 
task could be helped b y using a hypernym feature of WordNet 
project iFellbaum, 19981. The next section presents proba- 
bilistic model which makes use of the aforementioned feature 
variables. 

3 BIEN 

We convert the IE problem into a classification problem by 
assuming that each token in the document belongs to a target 
class corresponding to ether one of the target tags or the back- 
ground (compare to Freitag [19991). Furthermore, it seems 
important not to ignore the information about interdependen- 
cies of target fields and document segments. To combine ad- 
vantages of stochastic models with feature-based reasoning, 
we use a Bayesian network. 

A dynamic Bayesian network (DBN) is ideal for represent- 
ing probabilistic information about these features. Just like a 
Bayesian network, it encodes interdependence among various 
features. In addition, it incorporates the element of time, like 
an HMM, so that time-dependent patterns such as common or- 
ders of fields can be represented. All this is done in a compact 
representation that c an be learned from data. We refer to a re- 
cent dissertation I Murphy, 2002 1 for a good overview of all 
aspects of Dynamic Bayesian Networks. 

Each document is considered to be a single stream of to- 
kens. In our DBN, called the Bayesian Information Extraction 
Network (BIEN), the same structure is repeated for every in- 
dex. Figure ^ presents the structure of BIEN. This structure 
contains state variables and feature variables. The most im- 
portant state variable, for our purposes, is "Tag" which corre- 
sponds to information we are trying to extract. This variable 
classifies each token according to its target information field, 
or has the value "background" if the token does not belong 
to any field. "Last Target" is another hidden variable which 
reflects the order in which target information is found in the 
document. This variable is our way of implementing a mem- 
ory in a "memory-less" Markov model. Its value is determin- 
istically defined by the last non-background value of "Tag" 
variable. Another hidden variable, "Document Segment", is 
introduced to account for differences in patterns between the 
header and the main body of the document. The former is 
close to the structured text format, while the latter to the free 
text. "Document Segment" influences "Tag" and together 
these two influence the set of observable variables which rep- 
resent features of the text discussed in section |2] Standard 
inference algorithms for DBNs are similar to those for HMMs. 
In a DBN, some of the variables will typically be observed, 
while others will be hidden. The typical inference task is to 
determine the probability distribution over the states of a hid- 
den variable over time, given time series data of the observed 
variables. This is usually accomplished using the forward- 



Last target 




Figure 1: A schematic representation of BIEN. 



backward algorithm. Alternatively, we might want to know 
the most likely sequence of hidden variables. This is accom- 
plished using the Viterbi algorithm. Learning the parameters 
of a DBN from data is accomplished using the EM algorithm 
(see e.g. Murphy ||2"0Q2i|). Note that in principle, parts of 
the system could be trained separately on independent corpus 
to improve performance. For example, one could learn in- 
dependently the conditional vocabulary of email/newsgroup 
headers, or learn a probability of part-of-speech conditioned 
on a word, to avoid dependence on external PoS taggers. Also 
prior knowledge about the domain and the language could be 
set in the system this way. The fact that etime almost never 
precedes stime as well as the fact that speaker is never a verb 
could be encoded in a conditional probability table (CPT). In 
large DBNs, exact inference algorithms are intractable, and 
so a variety of approximate methods have been developed. 
However, the number of hidden state variables in our model 
is small enough to allow exact algorithms to work. Indeed, 
all hidden nodes in our model are discrete variables which 
assume just a few values. "DocumentSegment" is binary in 
{Header, Body} range; "LastTarget" has as many values as 
"Tag" — four per number of target fields plus one for the back- 
ground. 

4 Results 

Several researchers have reported results on the CMU sem- 
inar announcements corpus, which we have chosen in order 
to have a good basis of comparison. The CMU seminar an- 
nouncements corpus consists of 485 documents. Each an- 
nouncement contains some tags for target slots. On average 
starting time appears twice per document, while location and 
speaker 1.5 times, with up to 9 speaker slots and 4 loca- 
tion slots per document. Sometimes multiple instances of the 
same slot differ, e.g. speaker Dr . Steals also appears as 
Joe Steals 2 . Ending time, speaker and location are miss- 
ing from 48%, 16% and 5% of documents correspondingly. 

2 Obtaining 100% performance on the original corpus is impos- 
sible since some tags are misplaced and in general the corpus is not 
marked uniformly — sometimes secondary occurrences are ignored. 
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Table 2: Fl performance measure for various IE systems. 



In order to demonstrate our method, we have developed a web 
site which works with arbitrary seminar announcement and 
reveals some semantic tagging. We also make available a list 
of errors in the original corpus, along with our new derivative 
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measure geo- 

We report results using 
test as other publi- 
set | |Roth and Yih~2 001 



|Ciravegna, 2001| . The data is split randomly into training 
and testing set. The reported results are averaged over five 
runs. Table|2]presents a comparison with numerous previous 
attempts at the CMU seminar corpus. The figures are taken 
from Roth and Yih 12001 1. BIEN performs comparably to the 
best system in each category, while notably outperforming 
other systems in finding location. This is partly due to the 
"LastTarget" variable. "LastTarget" variable turns out to be 
generally useful. Here is the learned conditional probability 
table (CPT) for P(Target\LastTarget), where the element 
(I, J) corresponds to the probability to get target tag J after 
target tag / was seen. We learn that initial tag is stime or 
speaker with 2: 1 likelihood ratio; etime is naturally the most 
likely follower to stime and in turn forecasts location. 



The corpus and demo for this paper are available from 

|http : / /www . eecs . harvard . edu/ "pesha/papers ■ html| 
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Table 3: Fl performance comparison across implementations 
of BIEN with disabled features. 
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Figure 2: A learning curve for precision and recall with grow- 
ing training sample size. 
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Other variables turn out to be useless, e.g. the number of 
characters does not add anything to the performance, and nei- 
ther does the initially introduced "SeenTag" variable which 
kept track of all tags seen up to the current position. Ta- 
ble |3 presents performance of BIEN with various individual 
features turned off. Note that figures for complete BIEN are 
slightly better than in Table|2]since we pushed the fraction of 
the training data to the maximum. Capitalization helps iden- 
tify location and speaker, while losing it does not damage 
performance drastically. Although information is reflected in 
syntactic and semantic features, most names in documents do 
not identify a speaker. One would hope to capture all relevant 
information by syntactic and semantic categories, however 
BIEN does not fare well without observing "Lemma". Losing 
the semantic feature seriously undermines performance in lo- 
cation and speaker categories — ability to recognize names is 
rather valuable for many domains. 

Reported figures are based on 80%-20% split of the cor- 
pus. Increasing the size of training corpus did not dramati- 
cally improve the performance in terms of F measure, as fur- 
ther illustrated by Figure|2 which presents a learning curve — 
precision and recall averaged over all fields, as a function of 
training data fraction. Trained on a small sample, BIEN acts 
very conservatively rarely picking fields, therefore scoring 
high precision and poor recall. Having seen hundreds of tar- 
get field instances and tens of thousands of negative samples, 
BIEN learns to generalize, which leads to generous tagging 
i.e. lower precision and higher recall. 

So far we provide results obtained on the original CMU 
seminar announcements data, which is not very challeng- 
ing. Most documents contain the header section with all the 
target fields easily identifiable right after the corresponding 
key word. We have created a derivative dataset in which 



documents are stripped of headers and two extra fields are 
sought: date and topic. Indeed this corpus turned out to be 
more difficult, with our current set of features we obtain only 
64% performance on speaker and 68% performance on topic. 
Date does not present a challenge except for cases of regular 
weekly events or relative dates like "tomorrow". Admittedly, 
the bootstrapping test performance is not a guarantee of sys- 
tems performance on novel data since preliminary processing, 
i.e. tokenization and gazetteering, as well as choice for PoS 
tag set, lead to a strong bias towards the training corpus. 

5 Discussion 

We have described how to integrate various aspects of lan- 
guage into a single probabilistic model, and to incremen- 
tally build a robust IE system based on a Bayesian net- 
work. Currently, we are working on learning the structure of 
BIEN automatically. It seems to subject itself nicely to struc- 
tural EM IFriedman, 19981 |Murphy, 2002| . The first step is 
automatic selection of relevant features. Another direction 
of current work is using approximate inference. We have 
tried LBP (Loo py-belief Propagation) | Mu rphy et al, 1999| 
Murphy, 2002 1, but for the current structure of BIEN it seems 
to give no gain. More challenging applications which re- 
quire larger, stronger connected networks, will benefit from 
approximate inference algorithms. It will enable quick on- 
line inference on the network learned off-line with exact 
methods, as well as learning for cases where exact infer- 
ence is infeasible. One such network will result from in- 
tegrating a PoS tagger and other feature extractors into 
BIEN. This is a natural extension of BIEN since various 
text processing routines are mutually dependent. Consider 
for example PoS tagging, sentence boundary detection and 
named entities recognition. Another complex BIEN struc- 
ture will result if we try to better reflect complex relational 
inform ation ICaliff and Mooney, 1999 |Roth and Yih, 200 1| 
|Roth and Yih, 2002| e.g. to process cases like seminar can- 
cellations and rescheduling; and handle multi-slot extraction, 
e.g. multiple seminar announcements and conference sched- 
ules. 
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